Introduction

The objective of this notebook is to build an automated human activity recognition system. Main goal is to obtain highest cross-validated activity prediction performance by applying various data preprocessing and machine learning methods and tuning their parameters.

Labeled human acitivity used in this study is publicly available on Kaggle[1].

Throughout this workbook, I will follow an iterative process where I will go back and forward to apply various data visualization, data preprocessing and model-training methods while paying special attention on:

  • training time,
  • testing time,
  • prediction performance

My goal is, eventually, to learn more about the nature of the problem of activity recognition. I will mostly have an application developer view when I discuss the real life implications of the obtained results.

[1] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013 [ https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones ]

In [78]:
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

class_labels = ['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS', 'SITTING', 'STANDING', 'LAYING']
class_ids = range(6)

X_train = pd.read_csv('train.csv')
s_train = X_train['subject']
X_train.drop('subject', axis = 1, inplace = True)
y_train = X_train['Activity'].to_frame().reset_index()
X_train.drop('Activity', axis = 1, inplace = True)
y_train = y_train.replace(class_labels, [0, 1, 2, 3, 4, 5])

X_test = pd.read_csv('test.csv')
s_test = X_test['subject']
X_test.drop('subject', axis = 1, inplace = True)
y_test = X_test['Activity'].to_frame().reset_index()
X_test.drop(['Activity'], axis = 1, inplace = True)
y_test = y_test.replace(class_labels, [0, 1, 2, 3, 4, 5])

#NOTE: this contenation method (viz. append) is safer than concat
X = X_train.append(X_test, ignore_index=True)
y = y_train.append(y_test, ignore_index=True)

print X.shape

display(X.describe())
# display(y.describe())

#NOTE: append can adjust the index value of the appended dataframe whereas concatenation cannot. Concatenation may
#result in duplicate indices.
# dataframes = [X_train, X_test]
# X = pd.concat(dataframes)
# dataframes = [y_train, y_test]
# y = pd.concat(dataframes)
(10299, 561)
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-meanFreq() fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean)
count 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 ... 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000 10299.000000
mean 0.274347 -0.017743 -0.108925 -0.607784 -0.510191 -0.613064 -0.633593 -0.525697 -0.614989 -0.466732 ... 0.126708 -0.298592 -0.617700 0.007705 0.002648 0.017683 -0.009219 -0.496522 0.063255 -0.054284
std 0.067628 0.037128 0.053033 0.438694 0.500240 0.403657 0.413333 0.484201 0.399034 0.538707 ... 0.245443 0.320199 0.308796 0.336591 0.447364 0.616188 0.484770 0.511158 0.305468 0.268898
min -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% 0.262625 -0.024902 -0.121019 -0.992360 -0.976990 -0.979137 -0.993293 -0.977017 -0.979064 -0.935788 ... -0.019481 -0.536174 -0.841847 -0.124694 -0.287031 -0.493108 -0.389041 -0.817288 0.002151 -0.131880
50% 0.277174 -0.017162 -0.108596 -0.943030 -0.835032 -0.850773 -0.948244 -0.843670 -0.845068 -0.874825 ... 0.136245 -0.335160 -0.703402 0.008146 0.007668 0.017192 -0.007186 -0.715631 0.182028 -0.003882
75% 0.288354 -0.010625 -0.097589 -0.250293 -0.057336 -0.278737 -0.302033 -0.087405 -0.288149 -0.014641 ... 0.288960 -0.113167 -0.487981 0.149005 0.291490 0.536137 0.365996 -0.521503 0.250790 0.102970
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 561 columns

In [79]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# print np.histogram(y['Activity'], bins=len(class_ids))

plt.rcParams['figure.figsize'] = (12.0, 4.0)
plt.xticks(class_ids, class_labels)
plt.hist(y['Activity'], bins=np.arange(len(class_ids)+1)-0.5)
plt.title("Class Distribution")
plt.xlabel("Labels")
plt.ylabel("Frequency")
plt.show()

Iteration 1: Comparison of baseline classifiers

Before going into a more detailed work on the features and model training and testing, I will apply some of the supervised machine learning methods to have some idea about the baseline performance of those methods.

In [80]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import precision_recall_fscore_support
from time import time

from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

def train(clf, features, target):
    start = time()
    clf.fit(features, target)
    end = time()
    return end - start

def predict(clf, features):
    start = time()
    pred = clf.predict(features)
    end = time()
    return end - start, pred

clf_SGD = SGDClassifier(random_state = 42)
clf_Ada = AdaBoostClassifier(random_state = 42)
clf_DTR = DecisionTreeRegressor(random_state=42)
clf_KNC = KNeighborsClassifier()
clf_GNB = GaussianNB()
clf_SVM = SVC()

clfs = {clf_SGD, clf_Ada, clf_DTR, clf_KNC, clf_GNB, clf_SVM}

y_train_ = y_train['Activity']
y_test_ = y_test['Activity']
y_ = y['Activity']

for clf in clfs:
    printout = ""
    if clf == clf_SGD: printout = "SGD"
    elif clf == clf_Ada: printout = "Ada"
    elif clf == clf_DTR: printout = "DTR"
    elif clf == clf_KNC: printout = "KNC"
    elif clf == clf_GNB: printout = "GNB"
    elif clf == clf_SVM: printout = "SVM"

    results_precision = []
    results_recall = []
    results_fscore = []
    results_ttrain = []
    results_ttest = []
    kfold = cross_validation.KFold(X.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        start = time()
        clf.fit(X.iloc[train], y_[train])
        results_ttrain.append(time()-start)
        #NOTE: for some resason this line doesn't work.
#         t_train = train(clf, X.iloc[train], y_[train])
        t_test, y_pred = predict(clf, X.iloc[test])
        results_ttest.append(t_test)
        precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)        
        
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    printout += "  t_train: {:.4f}sec".format(np.mean(results_ttrain))
    printout += "  t_pred: {:.4f}sec".format(np.mean(results_ttest))
    print printout
KNC  precision: 0.91  recall: 0.91  fscore: 0.91  t_train: 0.6835sec  t_pred: 9.0054sec
GNB  precision: 0.80  recall: 0.73  fscore: 0.72  t_train: 0.1849sec  t_pred: 0.0448sec
SVM  precision: 0.94  recall: 0.94  fscore: 0.93  t_train: 9.9764sec  t_pred: 2.4184sec
DTR  precision: 0.88  recall: 0.87  fscore: 0.87  t_train: 5.2582sec  t_pred: 0.0017sec
SGD  precision: 0.95  recall: 0.94  fscore: 0.94  t_train: 0.3448sec  t_pred: 0.0016sec
Ada  precision: 0.37  recall: 0.54  fscore: 0.41  t_train: 30.7475sec  t_pred: 0.0321sec

Iteration 2: Having a closer look at the features

SVM and SGD have the highest precision, recall and f1-score. SGD is the quickest in prediction and second quickest in training. I will be using SVM as the main method while discovering different feature processing methods and at the and I will compare SGD with SVM.

Although SVM's cross-validated classification performance is already very high (p=0.94, r=0.94, f=0.93), further investigation might still yield in even a higher classification performance. For instance, removing the outliers is one way to improve the model. We can't visualize a 561-dimensional space in a human readable form but we can still have a look at how the features are distributed individually. I will plot the distribution of some of the features below.

Moreover, some of the features might be redundant. Redundant features can easily be distinguished by investigating the correlation between them and the other features. If there is high correlation with other features, that means there is no reason to carry this feature as the information represented by this feature is already conveyed through other features.

Therefore, correlation matrix will allow us to see the distribution of the feature values individually and correlation between the features as shown below. I will use SelectKBest method to choose a subset of features for further investigation. In order to decide on the number K, I will run an exhaustive training batch where I change the number K and monitor che change in the cross-validated prediction performance of the SVM model.

In [83]:
from sklearn.feature_selection import SelectKBest

d_kbest_to_precision = {}
d_kbest_to_recall = {}
d_kbest_to_f1score = {}

#TODO: switch back before submitting the workbook
# kbest_max = X.shape[1]/5
kbest_max = 18
clf = clf_SVM

for kbest in range (2, kbest_max):
    f_selector = SelectKBest(k=kbest)
    Xs = f_selector.fit(X, y_).transform(X)
    printout = "kbest: {:3d}".format(kbest)

    results_precision = []
    results_recall = []
    results_fscore = []
    results_ttrain = []
    results_ttest = []
    kfold = cross_validation.KFold(Xs.shape[0], n_folds=4, shuffle=False, random_state=42)
    for train, test in kfold:
        start = time()
        clf.fit(Xs[train], y_[train])
        results_ttrain.append(time()-start)
        #NOTE: for some resason this line doesn't work.
#         t_train = train(clf, X.iloc[train], y_[train])
        t_test, y_pred = predict(clf, Xs[test])
        results_ttest.append(t_test)
        precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)        
        
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    printout += "  t_train: {:.3f}sec".format(np.mean(results_ttrain))
    printout += "  t_pred: {:.3f}sec".format(np.mean(results_ttest))
    print printout
    
    d_kbest_to_precision[kbest]=np.mean(results_precision)
    d_kbest_to_recall[kbest]=np.mean(results_recall)
    d_kbest_to_f1score[kbest]=np.mean(results_fscore)
kbest:   2  precision: 0.73  recall: 0.69  fscore: 0.67  t_train: 0.649sec  t_pred: 0.383sec
kbest:   3  precision: 0.73  recall: 0.70  fscore: 0.68  t_train: 0.658sec  t_pred: 0.394sec
kbest:   4  precision: 0.74  recall: 0.70  fscore: 0.69  t_train: 0.709sec  t_pred: 0.411sec
kbest:   5  precision: 0.73  recall: 0.70  fscore: 0.68  t_train: 0.724sec  t_pred: 0.441sec
kbest:   6  precision: 0.74  recall: 0.72  fscore: 0.70  t_train: 0.688sec  t_pred: 0.421sec
kbest:   7  precision: 0.74  recall: 0.72  fscore: 0.71  t_train: 0.731sec  t_pred: 0.454sec
kbest:   8  precision: 0.75  recall: 0.73  fscore: 0.71  t_train: 0.757sec  t_pred: 0.469sec
kbest:   9  precision: 0.76  recall: 0.73  fscore: 0.72  t_train: 0.759sec  t_pred: 0.459sec
kbest:  10  precision: 0.78  recall: 0.75  fscore: 0.74  t_train: 0.740sec  t_pred: 0.454sec
kbest:  11  precision: 0.81  recall: 0.79  fscore: 0.78  t_train: 0.635sec  t_pred: 0.389sec
kbest:  12  precision: 0.82  recall: 0.79  fscore: 0.77  t_train: 0.645sec  t_pred: 0.403sec
kbest:  13  precision: 0.82  recall: 0.80  fscore: 0.78  t_train: 0.656sec  t_pred: 0.409sec
kbest:  14  precision: 0.83  recall: 0.80  fscore: 0.78  t_train: 0.655sec  t_pred: 0.427sec
kbest:  15  precision: 0.83  recall: 0.80  fscore: 0.79  t_train: 0.667sec  t_pred: 0.427sec
kbest:  16  precision: 0.85  recall: 0.82  fscore: 0.80  t_train: 0.652sec  t_pred: 0.425sec
kbest:  17  precision: 0.85  recall: 0.82  fscore: 0.81  t_train: 0.670sec  t_pred: 0.445sec
In [84]:
plt.rcParams['figure.figsize'] = (20.0, 8.0)
plt.grid(True)
major_ticks = np.arange(0, kbest_max, 20) 
minor_ticks = np.arange(0, kbest_max, 5)

# ax.set_xticks(major_ticks)                                                       
# ax.set_xticks(minor_ticks, minor=True) 
plt.xticks(minor_ticks)
plt.plot(d_kbest_to_precision.keys(), d_kbest_to_precision.values(), 'r',
        d_kbest_to_recall.keys(), d_kbest_to_recall.values(), 'g',
        d_kbest_to_f1score.keys(), d_kbest_to_f1score.values(), 'b')
plt.show()

Precision, recall and fscore values are calculated based on the class weighted average as there is imbalance between the number of class labels in the dataset.

I will consider the best 16 features to further investigate. This is where the classification scores peak for the first time, and it doesn't change that much after that point on.[NOTE: ADD LEGEND TO THE PLOT]

In [85]:
kbest_selected = 16
f_selector = SelectKBest(k=kbest_selected)
f_selector.fit(X, y['Activity'])
f_selected_indices = f_selector.get_support(indices=False)
Xs_cols = X.columns[f_selected_indices]
Xs = X[Xs_cols] # dataset with selected features
# display(Xs.describe())

Having normally distributed features is the fundamental assumption in many predictive models. Normal distribution is un-skewed. It means the probability of falling in the right or left side of the mean is equally likely. As we can see from the correlation matrix above, and the skewness test below, these features are quite skewed, even mostly bimodal. [NOTE: FURTHER DISCUSSION IS NEEDED]

In [86]:
import scipy.stats.stats as st
import operator

skness = st.skew(X)

d_feature2skew = {}
for skew, feature in zip(skness , X.columns.values.tolist()):
    d_feature2skew[feature]=skew
    
feature2skew = sorted(d_feature2skew.items(), key=operator.itemgetter(1), reverse=True)
# for key, value in feature2skew:
#     print str(value) + " " + str(key)
    
d_feature2absskew = {}
for key, value in feature2skew:
    d_feature2absskew[key]=abs(value)

feature2absskew = sorted(d_feature2absskew.items(), key=operator.itemgetter(1), reverse=False)
cnt = 0
for key, value in feature2absskew:
    printout = "{:2d}".format(cnt)
    for col_name in Xs_cols:
        if col_name == str(key):
            printout += "*"
            break
    printout += " " + str(value) + " " + str(key)
    print printout
    cnt += 1
 0 0.00215377964944 fBodyGyro-meanFreq()-X
 1 0.00414219174709 tBodyGyroJerk-arCoeff()-Y,2
 2 0.00706628542013 tBodyGyro-arCoeff()-Z,4
 3 0.00992216217266 tBodyGyroJerk-correlation()-X,Y
 4 0.0121648766824 fBodyGyro-entropy()-Z
 5 0.0132529765856 tBodyAccJerk-mean()-Y
 6 0.0135294982223 fBodyGyro-entropy()-Y
 7 0.0140510581956 tBodyGyro-correlation()-Y,Z
 8 0.0141803927318 tBodyAccJerk-arCoeff()-Z,2
 9 0.0151974117498 fBodyGyro-meanFreq()-Y
10 0.0174189176525 angle(tBodyGyroJerkMean,gravityMean)
11 0.0184313807779 tBodyGyro-arCoeff()-X,4
12 0.0190788744894 angle(tBodyAccJerkMean),gravityMean)
13 0.0202594286111 tBodyAcc-entropy()-Z
14 0.0237067101828 tBodyGyro-arCoeff()-Y,2
15 0.0240181996008 tBodyAccJerk-arCoeff()-X,1
16 0.0267752303541 tBodyAcc-arCoeff()-Z,4
17 0.0292376915742 fBodyAccJerk-meanFreq()-X
18 0.0313683663163 fBodyBodyAccJerkMag-meanFreq()
19 0.0324442156276 tBodyAccJerk-arCoeff()-Y,2
20 0.0327374255746 tBodyGyroJerkMag-arCoeff()3
21 0.0328805491166 tBodyGyroJerk-entropy()-Z
22 0.0339294282868 fBodyAccMag-meanFreq()
23 0.0341555811287 angle(tBodyGyroMean,gravityMean)
24 0.0347704422585 fBodyAccJerk-maxInds-Z
25 0.0366040420095 tBodyGyroJerk-arCoeff()-X,2
26 0.0367363850759 tBodyAccMag-arCoeff()4
27 0.0367363850759 tGravityAccMag-arCoeff()4
28 0.0371594517177 tBodyAccJerk-correlation()-Y,Z
29 0.0433351711594 fBodyBodyGyroMag-entropy()
30 0.0434904027503 tBodyGyro-mean()-Z
31 0.0443626711323 tBodyAccJerk-correlation()-X,Z
32 0.0454304423713 tBodyGyroMag-arCoeff()4
33 0.0464489387411 fBodyGyro-entropy()-X
34 0.0476021732207 tBodyGyroJerk-mean()-Y
35 0.0488936753149 tBodyAcc-arCoeff()-Y,4
36 0.0519532219898 tBodyGyroJerk-mean()-Z
37 0.0545814665646 tBodyGyro-arCoeff()-Y,3
38 0.0552462217102 tBodyAcc-entropy()-X
39 0.0557854012593 tBodyAccMag-arCoeff()1
40 0.0557854012593 tGravityAccMag-arCoeff()1
41 0.0580587212785 fBodyGyro-meanFreq()-Z
42 0.0613778913045 angle(tBodyAccMean,gravity)
43 0.0630934868348 tBodyGyro-arCoeff()-Z,2
44 0.0632301422719 tBodyGyroJerk-arCoeff()-Z,2
45 0.0646220364334 tBodyAccJerk-arCoeff()-Y,1
46 0.0700384546029 fBodyAcc-entropy()-Y
47 0.0710436767198 tBodyGyroMag-arCoeff()1
48* 0.0741980322368 tBodyAccJerk-entropy()-Y
49 0.0757464261295 tBodyGyroJerk-entropy()-X
50 0.0760363198177 tBodyGyroJerk-correlation()-X,Z
51* 0.0827741497792 tBodyAccJerkMag-entropy()
52 0.0829869378769 tBodyAccJerk-mean()-Z
53 0.0853114441915 fBodyAccMag-entropy()
54 0.0907095754366 fBodyBodyGyroMag-meanFreq()
55 0.0921940628055 fBodyAcc-meanFreq()-Y
56 0.0939906319389 tBodyGyro-arCoeff()-X,2
57 0.0953001247937 tBodyAccJerkMag-arCoeff()4
58 0.0960118554825 tBodyAccJerk-arCoeff()-X,2
59 0.098224881132 tBodyGyroMag-arCoeff()3
60 0.0991962249851 tBodyAcc-arCoeff()-X,1
61 0.099979327303 tGravityAcc-arCoeff()-Y,1
62 0.101139651259 tBodyAccJerk-correlation()-X,Y
63 0.101799399436 tBodyGyroJerk-arCoeff()-Y,4
64 0.10300178995 tBodyAcc-entropy()-Y
65* 0.10593815062 tBodyAccJerk-entropy()-X
66 0.106786202227 tBodyGyroJerk-entropy()-Y
67 0.112677206662 tBodyAccJerk-mean()-X
68 0.114114131131 tBodyAcc-arCoeff()-X,4
69 0.114721730966 tBodyGyroJerkMag-entropy()
70 0.118471361957 tBodyAcc-arCoeff()-Y,1
71 0.118747530895 tBodyAccJerkMag-arCoeff()3
72 0.121445179385 fBodyBodyGyroJerkMag-entropy()
73 0.121858685513 tBodyAccMag-arCoeff()3
74 0.121858685513 tGravityAccMag-arCoeff()3
75* 0.125307722428 fBodyAcc-entropy()-X
76 0.130851232936 tBodyAcc-arCoeff()-Z,1
77 0.136991541129 fBodyAcc-entropy()-Z
78 0.137409984693 tGravityAcc-arCoeff()-Y,2
79 0.137507565857 tBodyGyroJerk-arCoeff()-Z,4
80 0.139300306473 tBodyGyroJerk-mean()-X
81 0.147160728856 fBodyAcc-meanFreq()-X
82 0.149653463758 tBodyGyroJerk-arCoeff()-Y,3
83 0.154195552001 tBodyAccMag-arCoeff()2
84 0.154195552001 tGravityAccMag-arCoeff()2
85 0.154812849149 fBodyAccJerk-meanFreq()-Y
86 0.155470925423 tBodyAccMag-entropy()
87 0.155470925423 tGravityAccMag-entropy()
88 0.158932692684 tBodyGyroJerk-correlation()-Y,Z
89 0.159303581704 tBodyGyro-mean()-X
90 0.163859158844 tBodyAcc-arCoeff()-Y,3
91 0.164544722854 tBodyGyro-arCoeff()-Y,1
92* 0.165858145244 tBodyAccJerk-entropy()-Z
93 0.16913457338 tBodyGyro-arCoeff()-Z,1
94 0.170575928287 tBodyGyro-correlation()-X,Z
95 0.175181130788 tBodyAccJerk-arCoeff()-Z,1
96 0.178335455758 tBodyGyro-arCoeff()-Y,4
97 0.180856348488 tBodyGyroJerk-arCoeff()-Z,1
98 0.186965065231 tGravityAcc-correlation()-Y,Z
99 0.188543153359 fBodyAccJerk-maxInds-X
100* 0.193019223009 fBodyAccJerk-entropy()-Y
101 0.193645469357 tBodyGyro-mean()-Y
102 0.193893906745 tGravityAcc-arCoeff()-Z,1
103 0.194580580474 tBodyGyroMag-arCoeff()2
104 0.19922042653 tBodyGyroJerk-arCoeff()-X,4
105* 0.200122462248 fBodyAccJerk-entropy()-X
106 0.204214234218 fBodyAccJerk-meanFreq()-Z
107 0.204612460278 tBodyAccJerk-arCoeff()-X,3
108 0.208069769381 tBodyAcc-correlation()-X,Z
109 0.213666571165 fBodyAcc-meanFreq()-Z
110 0.214095214436 tBodyGyro-entropy()-X
111 0.214785397225 tGravityAcc-correlation()-X,Z
112 0.215444900162 tBodyAccJerk-arCoeff()-Y,4
113 0.217113781355 tGravityAcc-arCoeff()-Z,2
114 0.219865381579 tBodyGyroJerk-arCoeff()-X,3
115 0.221751063028 tBodyGyroJerk-arCoeff()-X,1
116 0.222750998528 tBodyGyro-correlation()-X,Y
117* 0.225381717128 fBodyBodyAccJerkMag-entropy()
118 0.22843411364 tBodyAccJerk-arCoeff()-X,4
119 0.228733434969 tBodyAccJerkMag-arCoeff()1
120 0.230064018175 tBodyGyro-arCoeff()-Z,3
121 0.231408007738 tBodyGyroJerk-arCoeff()-Y,1
122 0.244053232175 tGravityAcc-arCoeff()-Y,3
123 0.2443306184 tBodyGyroJerk-arCoeff()-Z,3
124 0.253080762541 tBodyAccJerk-arCoeff()-Z,4
125 0.254161363298 tBodyGyro-entropy()-Z
126 0.261247012195 tBodyGyro-arCoeff()-X,1
127 0.268626982874 tGravityAcc-arCoeff()-Z,3
128 0.268717083068 tGravityAcc-sma()
129* 0.269810028483 fBodyAccJerk-entropy()-Z
130 0.273219057633 tBodyAcc-arCoeff()-Y,2
131 0.277009317899 tBodyAcc-arCoeff()-X,3
132 0.28403249056 tBodyGyro-arCoeff()-X,3
133 0.289495524163 tBodyAccJerk-arCoeff()-Y,3
134 0.289900529157 tBodyGyroJerkMag-arCoeff()2
135 0.291933023627 tBodyGyro-entropy()-Y
136 0.295128885078 tBodyAcc-arCoeff()-Z,3
137 0.303321162769 tBodyGyroMag-entropy()
138 0.304452344728 tBodyAccJerk-arCoeff()-Z,3
139 0.322390135742 tBodyGyroJerkMag-arCoeff()4
140 0.329196303405 tGravityAcc-arCoeff()-X,1
141 0.336804681174 tGravityAcc-arCoeff()-Z,4
142 0.338993633106 fBodyBodyGyroJerkMag-meanFreq()
143 0.342579661878 tBodyAcc-correlation()-Y,Z
144 0.343362509718 tBodyAcc-arCoeff()-X,2
145 0.389602855112 tBodyAcc-sma()
146 0.396042671642 tGravityAcc-correlation()-X,Y
147 0.396432265548 tBodyAccJerkMag-arCoeff()2
148 0.398682156012 tGravityAcc-arCoeff()-Y,4
149 0.408332492233 tBodyAccMag-mean()
150 0.408332492233 tGravityAccMag-mean()
151 0.408332492233 tGravityAccMag-sma()
152 0.408332492233 tBodyAccMag-sma()
153 0.413061698455 tGravityAcc-arCoeff()-X,2
154 0.417992043488 tBodyAcc-arCoeff()-Z,2
155 0.429023069479 tBodyAcc-mean()-Y
156 0.43082581152 tBodyGyroJerkMag-arCoeff()1
157 0.435232982925 tBodyAcc-std()-Y
158 0.436261160742 tBodyAcc-mad()-Y
159 0.443094366228 fBodyAcc-std()-Y
160 0.453971605378 fBodyAcc-skewness()-X
161 0.469180783726 fBodyAcc-sma()
162 0.477263627038 fBodyAcc-mad()-Y
163 0.477907809184 fBodyAccJerk-maxInds-Y
164 0.493217019657 fBodyAcc-mean()-Y
165 0.508043011358 tGravityAcc-arCoeff()-X,3
166 0.508417501483 fBodyGyro-skewness()-X
167 0.51320896886 tBodyGyro-sma()
168 0.516584338001 tBodyGyroMag-sma()
169 0.516584338001 tBodyGyroMag-mean()
170 0.541058634127 tBodyAcc-iqr()-Y
171 0.557655266106 tBodyAccMag-max()
172 0.557655266106 tGravityAccMag-max()
173 0.564614326445 tBodyAcc-correlation()-X,Y
174 0.582602194063 tGravityAcc-arCoeff()-X,4
175 0.583765832366 tBodyAcc-max()-Y
176 0.588132924519 fBodyAcc-skewness()-Z
177 0.589216542858 tBodyAcc-min()-X
178 0.593972814384 tBodyAccJerkMag-mean()
179 0.593972814384 tBodyAccJerkMag-sma()
180 0.60133410791 tBodyAccJerk-sma()
181* 0.601852530578 tBodyAcc-max()-X
182 0.608103538332 fBodyAcc-mad()-X
183 0.611575881537 tBodyAccJerk-mad()-Y
184 0.612580907669 fBodyGyro-skewness()-Z
185 0.614430566037 fBodyBodyAccJerkMag-skewness()
186 0.617282993415 tBodyAccMag-std()
187 0.617282993415 tGravityAccMag-std()
188 0.617651845516 fBodyAcc-mean()-X
189 0.622762775737 fBodyAccJerk-sma()
190 0.623552618043 tGravityAccMag-mad()
191 0.623552618043 tBodyAccMag-mad()
192 0.627176518078 tBodyAcc-mad()-Z
193 0.634672943249 fBodyAccJerk-mean()-Y
194* 0.63692184831 tBodyAcc-std()-X
195 0.641100176633 fBodyAcc-max()-Y
196 0.642669231697 tBodyAccJerk-std()-Y
197 0.651217278251 tBodyAcc-iqr()-Z
198 0.651507095393 fBodyAccMag-std()
199 0.65252508798 fBodyAccMag-mean()
200 0.65252508798 fBodyAccMag-sma()
201 0.653564973792 fBodyGyro-sma()
202 0.660807503434 fBodyAccMag-mad()
203 0.66093300966 tBodyAcc-std()-Z
204 0.661588792698 tBodyAccJerk-iqr()-Y
205 0.661803542963 tBodyAccJerk-mad()-X
206 0.665097401768 tBodyAccJerk-std()-X
207 0.669183198287 fBodyAccJerk-mad()-Y
208 0.673583783075 fBodyAccJerk-mad()-X
209 0.675106725061 tBodyAccJerkMag-std()
210 0.675303264928 fBodyAccJerk-std()-X
211 0.676185698675 fBodyAcc-std()-X
212 0.678208347527 tBodyAccJerkMag-mad()
213 0.679540466169 fBodyBodyAccJerkMag-mean()
214 0.679540466169 fBodyBodyAccJerkMag-sma()
215 0.684484457536 tBodyAcc-mad()-X
216 0.686560508799 fBodyAccJerk-std()-Y
217 0.689904357473 fBodyAccJerk-mean()-X
218 0.692754520232 fBodyBodyGyroJerkMag-skewness()
219 0.696713240294 fBodyBodyAccJerkMag-mad()
220 0.698429267795 fBodyAcc-mad()-Z
221 0.699368882229 fBodyGyro-mean()-Z
222 0.706242466919 tBodyAccJerkMag-max()
223 0.709445199772 fBodyGyro-mad()-Z
224 0.710484576166 tGravityAcc-max()-Z
225 0.714516797617 fBodyAcc-std()-Z
226 0.714933578877 tGravityAcc-mean()-Z
227 0.716048380986 tBodyAccJerk-min()-X
228 0.717970781452 tGravityAcc-min()-Z
229 0.719182210998 tBodyAccJerk-iqr()-X
230 0.725043436567 fBodyBodyAccJerkMag-std()
231 0.72519230022 tBodyAccJerkMag-iqr()
232 0.731562470842 fBodyGyro-mad()-X
233 0.734125201145 tBodyAccMag-iqr()
234 0.734125201145 tGravityAccMag-iqr()
235 0.736326449457 fBodyAcc-mean()-Z
236 0.738337523911 fBodyGyro-skewness()-Y
237 0.739820217699 fBodyGyro-mean()-X
238 0.740168755755 fBodyAcc-iqr()-Y
239 0.742596813893 fBodyAccJerk-iqr()-Y
240 0.755145200486 tBodyAcc-min()-Y
241 0.759804971223 tBodyGyroMag-mad()
242 0.767551861107 tBodyGyroMag-iqr()
243 0.768802271761 fBodyBodyAccJerkMag-iqr()
244 0.7778882466 tBodyGyro-std()-Z
245 0.787113694259 fBodyBodyGyroMag-skewness()
246 0.793890804483 fBodyAccJerk-skewness()-X
247 0.795330868908 tBodyGyro-std()-X
248 0.80030957345 tBodyGyro-mad()-X
249 0.803066493061 fBodyAcc-iqr()-X
250 0.807857692993 fBodyAccJerk-iqr()-X
251 0.810395438671 tBodyAcc-max()-Z
252 0.810580413835 tBodyGyroMag-std()
253 0.819382972131 fBodyBodyGyroMag-mad()
254 0.839751900174 fBodyAccJerk-max()-X
255 0.842219193579 tBodyGyro-mad()-Z
256 0.852631117761 tBodyAcc-iqr()-X
257 0.853000598265 fBodyAccMag-max()
258 0.854968192851 fBodyAccMag-iqr()
259 0.855588638168 fBodyAccJerk-max()-Y
260 0.859952417894 tBodyGyroJerk-mad()-X
261 0.863258341578 tBodyGyroMag-max()
262 0.864427916953 tBodyGyro-iqr()-X
263 0.865735592061 tGravityAcc-entropy()-X
264 0.868902058375 fBodyGyro-std()-X
265 0.876383470075 fBodyAcc-max()-X
266 0.878483286913 tBodyGyroJerk-std()-X
267 0.88019395377 tBodyAccJerk-min()-Y
268 0.885562703711 fBodyBodyAccJerkMag-max()
269 0.885977202538 tBodyGyroJerk-mad()-Z
270 0.887792803809 tBodyGyroJerk-iqr()-Z
271 0.891713039859 fBodyGyro-std()-Z
272 0.894074155545 tBodyGyroJerk-iqr()-X
273 0.896579765667 fBodyBodyGyroMag-std()
274 0.897149099437 fBodyBodyGyroMag-sma()
275 0.897149099437 fBodyBodyGyroMag-mean()
276 0.899971132243 tBodyGyro-max()-Z
277 0.90709209691 angle(Z,gravityMean)
278 0.921737927362 fBodyAcc-kurtosis()-X
279 0.930881345884 tBodyGyro-min()-Z
280 0.932021451793 tBodyGyroJerk-sma()
281 0.956959816952 fBodyAcc-max()-Z
282 0.95958198408 tBodyGyroJerk-std()-Z
283 0.974333508872 fBodyGyro-iqr()-X
284 0.976344174031 tBodyGyro-mad()-Y
285 0.981398684588 fBodyAccJerk-skewness()-Y
286 0.986389047812 tBodyGyroJerkMag-sma()
287 0.986389047812 tBodyGyroJerkMag-mean()
288 0.998595950474 fBodyBodyGyroMag-iqr()
289 0.999404578414 tBodyGyro-std()-Y
290 1.00743628222 tBodyGyro-max()-X
291 1.00883017514 fBodyAcc-kurtosis()-Z
292 1.01349820481 fBodyGyro-iqr()-Z
293 1.02955007896 tBodyGyro-iqr()-Y
294 1.03199153378 tBodyAccJerk-mad()-Z
295 1.0405543979 fBodyAccJerk-mean()-Z
296 1.04219838415 fBodyGyro-mad()-Y
297 1.04887094126 tBodyAccJerk-max()-Y
298 1.0501379703 tBodyAccJerk-std()-Z
299 1.05567901482 tGravityAcc-entropy()-Z
300 1.06901621026 fBodyAccJerk-mad()-Z
301 1.07197804091 tBodyAccJerk-max()-X
302 1.07218104917 fBodyGyro-mean()-Y
303 1.07518663196 fBodyGyro-std()-Y
304 1.07665757147 tBodyAccJerk-iqr()-Z
305 1.07931144188 tBodyGyroJerk-max()-X
306 1.07952031179 fBodyGyro-kurtosis()-X
307 1.08662739876 fBodyBodyGyroMag-max()
308 1.08821253386 tGravityAccMag-min()
309 1.08821253386 tBodyAccMag-min()
310 1.08825818362 fBodyAcc-iqr()-Z
311 1.09433852001 tBodyGyro-iqr()-Z
312 1.09835986696 fBodyAccJerk-skewness()-Z
313 1.10115843382 fBodyGyro-max()-X
314 1.10329333471 tBodyAccJerk-min()-Z
315 1.10860332085 fBodyAccJerk-std()-Z
316 1.10942102955 fBodyAcc-skewness()-Y
317 1.13340084372 tBodyAcc-min()-Z
318 1.14593048845 fBodyGyro-kurtosis()-Z
319 1.15873029322 tGravityAcc-max()-Y
320 1.16513795296 fBodyAccJerk-iqr()-Z
321 1.16535896438 fBodyAccMag-skewness()
322 1.16693831205 tGravityAcc-mean()-Y
323 1.17353661325 tGravityAcc-min()-Y
324 1.18895812907 tBodyGyro-min()-X
325 1.18918166351 tBodyGyroMag-min()
326 1.19444176141 tBodyGyroJerk-min()-X
327 1.20000369521 tGravityAccMag-energy()
328 1.20000369521 tBodyAccMag-energy()
329 1.21398463721 tBodyAccJerkMag-min()
330 1.2170597452 fBodyAccMag-maxInds
331 1.23073733675 tBodyGyroJerkMag-iqr()
332 1.23968057574 tBodyGyroJerk-max()-Z
333 1.26181895937 fBodyAcc-energy()-Y
334 1.27634939598 fBodyAcc-bandsEnergy()-1,24.1
335 1.29213257034 fBodyBodyAccJerkMag-kurtosis()
336 1.29581582098 fBodyGyro-kurtosis()-Y
337 1.31261622275 fBodyAcc-bandsEnergy()-1,16.1
338 1.34253009705 tBodyGyro-min()-Y
339 1.35945450834 tBodyGyroJerkMag-mad()
340 1.36135059225 fBodyGyro-max()-Z
341 1.3755465212 tBodyGyroJerkMag-min()
342 1.39776607041 fBodyBodyGyroJerkMag-mean()
343 1.39776607041 fBodyBodyGyroJerkMag-sma()
344 1.3977815421 fBodyAcc-bandsEnergy()-1,8.1
345 1.39799851089 fBodyAccJerk-max()-Z
346 1.42293058033 fBodyBodyGyroMag-kurtosis()
347* 1.42296186491 angle(X,gravityMean)
348 1.4254605819 angle(Y,gravityMean)
349* 1.42884856959 tGravityAcc-energy()-X
350 1.43097916293 fBodyGyro-iqr()-Y
351 1.46387104576 tBodyGyroJerkMag-std()
352 1.46409820924 tBodyGyroJerk-iqr()-Y
353 1.46430027007 tBodyGyro-max()-Y
354 1.49168954209 fBodyBodyGyroJerkMag-iqr()
355 1.49905735202 tBodyGyroJerk-min()-Z
356 1.5159992198 tBodyAccJerk-max()-Z
357 1.52215074375 tBodyGyroJerk-mad()-Y
358 1.53079497022 fBodyBodyGyroJerkMag-kurtosis()
359 1.56789558629 fBodyGyro-max()-Y
360 1.56850976571 fBodyBodyGyroJerkMag-mad()
361 1.59115715938 tBodyGyroJerkMag-max()
362 1.60008238638 tBodyGyroJerk-std()-Y
363 1.61939135517 fBodyBodyGyroJerkMag-std()
364 1.62234947919 fBodyAcc-kurtosis()-Y
365* 1.62648611505 tGravityAcc-min()-X
366 1.6291553282 tBodyAccJerk-energy()-Y
367* 1.62924391195 tGravityAcc-mean()-X
368 1.62957406303 fBodyAccJerk-energy()-Y
369* 1.64228675532 tGravityAcc-max()-X
370 1.65279262445 fBodyBodyGyroJerkMag-max()
371 1.65833564923 fBodyAccMag-energy()
372 1.66158188414 tBodyAccJerkMag-energy()
373 1.69826265506 fBodyAccJerk-bandsEnergy()-1,24.1
374 1.7089686733 tBodyGyroMag-energy()
375 1.73477173926 fBodyAcc-maxInds-Y
376 1.7348920256 tBodyAcc-energy()-X
377 1.7366745789 fBodyAcc-energy()-X
378 1.74743273066 fBodyAcc-bandsEnergy()-1,24
379 1.75939151981 tBodyAcc-mean()-Z
380 1.76991705001 fBodyBodyAccJerkMag-energy()
381 1.77582077434 fBodyAccJerk-bandsEnergy()-1,24
382 1.79493081519 tBodyAccJerk-energy()-X
383 1.79530053652 fBodyAccJerk-energy()-X
384 1.80650371815 fBodyGyro-maxInds-Z
385 1.80925018427 fBodyAcc-bandsEnergy()-1,16
386 1.8433187822 fBodyBodyAccJerkMag-min()
387 1.89850253278 fBodyAccMag-kurtosis()
388 1.90569071283 fBodyAcc-bandsEnergy()-1,8
389 1.90848665811 fBodyAccJerk-bandsEnergy()-1,8.1
390 1.91887830887 tBodyAcc-energy()-Y
391 1.93151286553 tBodyGyroJerk-max()-Y
392 1.94933817053 fBodyAcc-maxInds-X
393 1.95792153439 fBodyAccJerk-kurtosis()-X
394 1.97948618158 tBodyGyroJerk-min()-Y
395 1.9882925733 fBodyAcc-bandsEnergy()-17,32.1
396 1.99964199588 fBodyAccJerk-bandsEnergy()-1,16.1
397 2.01684782491 fBodyAccJerk-bandsEnergy()-1,16
398 2.03407449851 fBodyAccJerk-bandsEnergy()-17,32.1
399 2.045096371 fBodyAccJerk-bandsEnergy()-17,24.1
400 2.07009971894 fBodyAcc-bandsEnergy()-17,24.1
401 2.07793048557 fBodyGyro-maxInds-Y
402 2.10651790552 fBodyAcc-energy()-Z
403 2.13276463034 fBodyAcc-bandsEnergy()-17,32
404 2.14868233163 fBodyAcc-bandsEnergy()-1,24.2
405 2.18854227131 fBodyAccJerk-bandsEnergy()-17,32
406 2.22358605035 tBodyAcc-energy()-Z
407 2.24474898421 fBodyAcc-bandsEnergy()-9,16.1
408 2.24701526088 tGravityAcc-energy()-Y
409 2.24978653007 fBodyAccJerk-bandsEnergy()-9,16
410 2.26633907716 fBodyAcc-bandsEnergy()-17,24
411 2.31396289398 fBodyAccJerk-bandsEnergy()-17,24
412 2.336852093 fBodyAcc-bandsEnergy()-9,16
413 2.34514812817 fBodyAcc-bandsEnergy()-1,16.2
414 2.3455219318 fBodyAccJerk-bandsEnergy()-9,16.1
415 2.35645334898 fBodyBodyGyroJerkMag-min()
416 2.3568462525 tBodyGyro-energy()-X
417 2.36126874056 tGravityAcc-energy()-Z
418 2.37346365479 fBodyAccJerk-min()-Y
419 2.40285801332 fBodyBodyGyroMag-maxInds
420 2.42840475392 fBodyAccJerk-bandsEnergy()-25,48
421 2.44758886656 fBodyAcc-bandsEnergy()-25,48
422 2.45866310095 fBodyAccJerk-bandsEnergy()-1,8
423 2.49067063891 fBodyAccJerk-min()-X
424 2.50462343685 fBodyAcc-min()-X
425 2.54422208703 fBodyAccJerk-bandsEnergy()-33,48
426 2.59011669381 fBodyBodyGyroMag-energy()
427 2.59256320402 tGravityAcc-entropy()-Y
428 2.60946390506 fBodyAccJerk-bandsEnergy()-1,16.2
429 2.61566522733 fBodyAccJerk-bandsEnergy()-25,48.1
430 2.61589403969 fBodyAcc-bandsEnergy()-1,8.2
431 2.62077556636 fBodyAcc-maxInds-Z
432 2.62423588132 fBodyAccJerk-bandsEnergy()-41,48.1
433 2.63053344183 fBodyAcc-bandsEnergy()-25,48.1
434 2.6335532331 fBodyAcc-bandsEnergy()-33,48
435 2.65260703339 fBodyAcc-bandsEnergy()-41,48.1
436 2.66225442743 fBodyAccJerk-bandsEnergy()-33,48.1
437 2.67372731772 fBodyAcc-bandsEnergy()-33,48.1
438 2.71054665393 fBodyGyro-maxInds-X
439 2.7385321743 fBodyAcc-min()-Y
440 2.74009905589 fBodyAcc-bandsEnergy()-33,40
441 2.75066169455 fBodyAccJerk-min()-Z
442 2.76255744324 tBodyGyro-energy()-Y
443 2.7706487508 fBodyAcc-bandsEnergy()-25,32
444 2.78257316546 fBodyAccJerk-bandsEnergy()-1,24.2
445 2.79003247183 fBodyGyro-bandsEnergy()-1,24.1
446 2.79823345941 fBodyGyro-energy()-Y
447 2.80447415043 fBodyGyro-energy()-X
448 2.81588617031 fBodyAccMag-min()
449 2.82740155386 fBodyGyro-bandsEnergy()-1,24
450 2.83090612019 fBodyAccJerk-bandsEnergy()-33,40
451 2.83905120874 fBodyGyro-energy()-Z
452 2.8708380889 fBodyAccJerk-bandsEnergy()-41,48
453 2.8716980296 fBodyAccJerk-bandsEnergy()-25,32
454 2.88808169999 fBodyBodyGyroMag-min()
455 2.8924360657 fBodyAccJerk-kurtosis()-Z
456 2.89967914491 fBodyGyro-bandsEnergy()-1,16
457 2.91764446591 fBodyGyro-bandsEnergy()-1,24.2
458 2.92033989182 fBodyAccJerk-bandsEnergy()-1,8.2
459 2.9410860016 fBodyAcc-bandsEnergy()-9,16.2
460 2.95395782692 fBodyGyro-bandsEnergy()-1,16.1
461 2.95716083346 tBodyGyro-energy()-Z
462 2.98624322256 fBodyGyro-bandsEnergy()-9,16
463 2.99641078034 fBodyAccJerk-bandsEnergy()-9,16.2
464 3.00879643509 fBodyAccJerk-bandsEnergy()-49,56.1
465 3.0202157803 fBodyAcc-bandsEnergy()-25,32.1
466 3.02815577794 fBodyGyro-bandsEnergy()-17,24
467 3.05103206797 tBodyGyroJerk-energy()-X
468 3.06142206628 fBodyAccJerk-bandsEnergy()-25,32.1
469 3.06959234778 fBodyGyro-min()-Y
470 3.07861497393 fBodyAccJerk-bandsEnergy()-49,64.1
471 3.08181367606 fBodyAcc-min()-Z
472 3.09306441279 fBodyAccJerk-kurtosis()-Y
473 3.13165825465 fBodyGyro-min()-Z
474 3.13299368395 fBodyAcc-bandsEnergy()-41,48
475 3.14492920624 fBodyAcc-bandsEnergy()-33,40.1
476 3.15695398264 fBodyGyro-bandsEnergy()-1,8.1
477 3.15860865885 fBodyGyro-bandsEnergy()-1,16.2
478 3.16108569892 fBodyGyro-bandsEnergy()-17,32
479 3.24112637398 tBodyGyroJerk-energy()-Z
480 3.2601938792 fBodyGyro-bandsEnergy()-1,8
481 3.3003295991 tBodyAccJerk-energy()-Z
482 3.30127703451 fBodyAccJerk-energy()-Z
483 3.31674074595 fBodyAcc-bandsEnergy()-49,56.1
484 3.32832743672 fBodyAccJerk-bandsEnergy()-33,40.1
485 3.48989403158 tBodyAcc-mean()-X
486 3.66852703812 fBodyGyro-bandsEnergy()-1,8.2
487 3.75557639651 fBodyGyro-min()-X
488 3.80409109096 fBodyAccJerk-bandsEnergy()-17,24.2
489 3.8231253941 tBodyGyroJerkMag-energy()
490 3.91929238835 fBodyAcc-bandsEnergy()-17,32.2
491 3.92920919826 fBodyAccJerk-bandsEnergy()-49,56
492 3.93463660304 fBodyAccJerk-bandsEnergy()-49,64
493 3.93546285531 fBodyAcc-bandsEnergy()-17,24.2
494 3.99177784967 fBodyAccJerk-bandsEnergy()-49,64.2
495 3.99292463278 fBodyAccJerk-bandsEnergy()-49,56.2
496 4.04507460345 fBodyGyro-bandsEnergy()-17,32.2
497 4.06162215152 fBodyAcc-bandsEnergy()-49,64.1
498 4.09906618365 fBodyAcc-bandsEnergy()-49,56
499 4.1858605803 fBodyAccJerk-bandsEnergy()-17,32.2
500 4.2297130126 fBodyAcc-bandsEnergy()-49,56.2
501 4.25988077353 fBodyGyro-bandsEnergy()-17,24.2
502 4.44474574359 fBodyAcc-bandsEnergy()-49,64
503 4.51718190618 fBodyAcc-bandsEnergy()-41,48.2
504 4.59780071293 fBodyAccJerk-bandsEnergy()-41,48.2
505 4.62452087934 fBodyAcc-bandsEnergy()-49,64.2
506 4.676335405 fBodyBodyGyroJerkMag-energy()
507 4.75113147262 fBodyGyro-bandsEnergy()-9,16.2
508 4.889908241 tBodyGyroJerk-energy()-Y
509 4.98416905982 fBodyAcc-bandsEnergy()-57,64
510 5.01469251886 fBodyBodyAccJerkMag-maxInds
511 5.19471100405 fBodyGyro-bandsEnergy()-41,48
512 5.20335604517 fBodyGyro-bandsEnergy()-33,48
513 5.28556934279 fBodyBodyGyroJerkMag-maxInds
514 5.39734500466 fBodyGyro-bandsEnergy()-49,56.2
515 5.57328329589 fBodyGyro-bandsEnergy()-25,48
516 5.70075950591 fBodyAcc-bandsEnergy()-57,64.1
517 5.79595156173 fBodyAcc-bandsEnergy()-25,48.2
518 5.87487460185 fBodyAcc-bandsEnergy()-33,48.2
519 5.91588430745 fBodyAcc-bandsEnergy()-25,32.2
520 5.93584433775 fBodyAcc-bandsEnergy()-57,64.2
521 5.96009248698 fBodyGyro-bandsEnergy()-33,40
522 5.96164826714 fBodyGyro-bandsEnergy()-25,32
523 6.02229862855 fBodyGyro-bandsEnergy()-49,56.1
524 6.02475857025 fBodyAccJerk-bandsEnergy()-25,48.2
525 6.08827213752 fBodyAccJerk-bandsEnergy()-25,32.2
526 6.11414792659 fBodyAccJerk-bandsEnergy()-33,48.2
527 6.18683017123 fBodyGyro-bandsEnergy()-49,64.1
528 6.29905585908 fBodyGyro-bandsEnergy()-41,48.2
529 6.33657460356 fBodyGyro-bandsEnergy()-17,32.1
530 6.34333429702 fBodyGyro-bandsEnergy()-9,16.1
531 6.3519826935 fBodyGyro-bandsEnergy()-49,64.2
532 6.35607108151 fBodyGyro-bandsEnergy()-41,48.1
533 6.79014465903 fBodyGyro-bandsEnergy()-49,56
534 6.92781847882 tGravityAcc-std()-Z
535 7.01664123817 tGravityAcc-mad()-Z
536 7.02811277083 fBodyGyro-bandsEnergy()-25,32.2
537 7.25359421168 fBodyGyro-bandsEnergy()-17,24.1
538 7.2757680758 tGravityAcc-iqr()-Z
539 7.30307814418 fBodyAcc-bandsEnergy()-33,40.2
540 7.37033016756 fBodyGyro-bandsEnergy()-49,64
541 7.43064576666 fBodyGyro-bandsEnergy()-25,48.2
542 7.49245465563 fBodyGyro-bandsEnergy()-25,32.1
543 7.8117806773 fBodyAccJerk-bandsEnergy()-33,40.2
544 7.81368318394 fBodyGyro-bandsEnergy()-57,64.2
545 7.89396800077 fBodyGyro-bandsEnergy()-25,48.1
546 8.0421632488 fBodyGyro-bandsEnergy()-57,64
547 8.15162096692 fBodyAccJerk-bandsEnergy()-57,64.2
548 8.21504807324 fBodyGyro-bandsEnergy()-33,48.2
549 8.23969565046 fBodyGyro-bandsEnergy()-57,64.1
550 8.51410725858 fBodyAccJerk-bandsEnergy()-57,64.1
551 8.78163491799 tGravityAcc-std()-Y
552 8.9289275525 tGravityAcc-mad()-Y
553 9.15793962521 fBodyGyro-bandsEnergy()-33,40.2
554 9.46984761412 tGravityAcc-iqr()-Y
555 10.2724333963 fBodyGyro-bandsEnergy()-33,48.1
556 11.1342039411 tGravityAcc-std()-X
557 11.4406066343 tGravityAcc-mad()-X
558 12.3405253118 tGravityAcc-iqr()-X
559 12.3582190015 fBodyGyro-bandsEnergy()-33,40.1
560 14.0274421291 fBodyAccJerk-bandsEnergy()-57,64

In addition to the KBest feature selection we can also use PCA to reduce the feature set size. First let's have a closer look at the two major components to see the natural distinction of the feature space defined by the PCA components.

In [87]:
# Produce a scatter matrix for each pair of features in the data
axes = pd.scatter_matrix(Xs, alpha = 0.3, figsize = (20,32), diagonal = 'kde')

# Reformat data.corr() for plotting
corr = Xs.corr().as_matrix()

# Plot scatter matrix with correlations
for i,j in zip(*np.triu_indices_from(axes, k=1)):
    axes[i,j].annotate("%.2f"%corr[i,j], (0.1,0.25), xycoords='axes fraction', color='red', fontsize=16)

Skewness greater than zero shows a positively skewed distribution, while lower than zero shows a negatively skewed distribution. Replacing the data with the log, square root, or inverse may help to remove the skew. However, feature values of the current dataset with selected features change between -1 and 1. Therefore, sqrt and log is not applicable. If we apply any of those transformations, most of the feature values will turn into NaN and dataset will be useless.

To avoid this we first shift data to a non-negative range, then apply the non-linear transformation, after that scale it back to -1 and 1 to be able to compare the change in the feature distribution with bare eyes. If all go right, we should be able to see less skewed feature distribution.

In addition to sqrt-ing and log-ing, I will also try boxcox-ing to reduce the skewness.

Discussion 2

SVM's classification performance for 16, 56, and 561 features are as follows:
n_features: 16 t_train: 0.802sec t_pred: 0.620sec precision: 0.84 recall: 0.81 fscore: 0.79
n_features: 56 t_train: 1.421sec t_pred: 1.247sec precision: 0.88 recall: 0.83 fscore: 0.81
n_features: 561 t_train: 8.916sec t_pred: 7.667sec precision: 0.94 recall: 0.94 fscore: 0.94

Using 16-feature reduced the training and testing time by more than 10 times while losing 10% classification performance measured as precision, recall and fscore. When compared to choosing 56 best features (top 10% of the whole feature vector), we see that 16-feature is almost as good as 56-feature in classification performance. 16-feature is twice faster than 56-feature in training and testing times.

As 16-feature is good enough for SVM, now I will find ways to improve the classification performance by scaling, normalizing and outlier-removal. First, let's have a look at how the features are distributed by using a correlation matrix.

In [88]:
import scipy.stats.stats as st
skness = st.skew(Xs)

for skew, feature_name in zip(skness , Xs_cols.tolist()):
    print "skewness: {:+.2f}\t\t feature: ".format(skew) + feature_name
skewness: +0.64		 feature: tBodyAcc-std()-X
skewness: +0.60		 feature: tBodyAcc-max()-X
skewness: -1.63		 feature: tGravityAcc-mean()-X
skewness: -1.64		 feature: tGravityAcc-max()-X
skewness: -1.63		 feature: tGravityAcc-min()-X
skewness: -1.43		 feature: tGravityAcc-energy()-X
skewness: +0.11		 feature: tBodyAccJerk-entropy()-X
skewness: +0.07		 feature: tBodyAccJerk-entropy()-Y
skewness: +0.17		 feature: tBodyAccJerk-entropy()-Z
skewness: +0.08		 feature: tBodyAccJerkMag-entropy()
skewness: +0.13		 feature: fBodyAcc-entropy()-X
skewness: +0.20		 feature: fBodyAccJerk-entropy()-X
skewness: +0.19		 feature: fBodyAccJerk-entropy()-Y
skewness: +0.27		 feature: fBodyAccJerk-entropy()-Z
skewness: +0.23		 feature: fBodyBodyAccJerkMag-entropy()
skewness: +1.42		 feature: angle(X,gravityMean)
In [8]:
from sklearn import preprocessing
from scipy.stats import boxcox

plt.rcParams['figure.figsize'] = (20.0, 80.0)
f, axarr = plt.subplots(len(Xs_cols.tolist()), 4, sharey=True)

preprocessing_names = ["noproc", "sqrted", "logged", "bxcxed"]

cnt = 0
for feature in Xs_cols.tolist():
    
    for i in range(4):
#         axarr[cnt, i].set_title( "[" + preprocessing_names[i] + "] "+ feature)
        axarr[cnt, i].set_title(feature + " histogram")
        axarr[cnt, i].set_xlabel(feature)
        axarr[cnt, i].set_ylabel("number of data points")
    
    Xs_feature = Xs[feature]
    skness = st.skew(Xs_feature)
    axarr[cnt, 0].hist(Xs_feature,facecolor='blue',alpha=0.75)
    axarr[cnt, 0].text(0.05, 0.95, 'Skewness[noproc]: {:.2f}'.format(skness), transform=axarr[cnt, 0].transAxes, 
                       fontsize=12, verticalalignment='top', color='red')
    
    Xs_feature_scaled = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_feature)

    Xs_feature_sqrted = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.sqrt(Xs_feature_scaled))
#     Xs_feature_sqrted = preprocessing.scale(np.sqrt(Xs_feature_scaled))
    skness = st.skew(Xs_feature_sqrted)
    axarr[cnt, 1].hist(Xs_feature_sqrted,facecolor='blue',alpha=0.75)
    axarr[cnt, 1].text(0.05, 0.95, 'Skewness[sqrted]: {:.2f}'.format(skness), transform=axarr[cnt, 1].transAxes, 
                       fontsize=12, verticalalignment='top', color='green')
    
    Xs_feature_logged = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.log(Xs_feature_scaled))
#     Xs_feature_logged = preprocessing.scale(np.log(Xs_feature_scaled))
    skness = st.skew(Xs_feature_logged)
    axarr[cnt, 2].hist(Xs_feature_logged,facecolor='blue',alpha=0.75)
    axarr[cnt, 2].text(0.05, 0.95, 'Skewness[logged]: {:.2f}'.format(skness), transform=axarr[cnt, 2].transAxes, 
                       fontsize=12, verticalalignment='top', color='green')
    
    Xs_feature_bxcxed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(boxcox(Xs_feature_scaled)[0])
#     Xs_feature_bxcxed = preprocessing.scale(boxcox(Xs_feature_scaled)[0])
    skness = st.skew(Xs_feature_bxcxed)
    axarr[cnt, 3].hist(Xs_feature_bxcxed,facecolor='blue',alpha=0.75)
    axarr[cnt, 3].text(0.05, 0.95, 'Skewness[bxcxed]: {:.2f}'.format(skness), transform=axarr[cnt, 3].transAxes, fontsize=12, 
                       verticalalignment='top', color='green',  bbox=dict(facecolor='white', alpha=0.5, boxstyle='square'))    
    cnt += 1

plt.show()

Tried robust scaler but it didn't have any effect on the dataset's skewness

In [9]:
Xs_rscaled = preprocessing.RobustScaler().fit_transform(Xs)
print Xs_rscaled.shape

for feature in range(Xs_rscaled.shape[1]):
    Xs_rscaled_feature = Xs_rscaled[:,feature]
    skness = st.skew(Xs_rscaled_feature)
    print "{:2d}".format(feature) + "  {:+.2f}".format(skness)
(10299L, 16L)
 0  +0.64
 1  +0.60
 2  -1.63
 3  -1.64
 4  -1.63
 5  -1.43
 6  +0.11
 7  +0.07
 8  +0.17
 9  +0.08
10  +0.13
11  +0.20
12  +0.19
13  +0.27
14  +0.23
15  +1.42
In [10]:
def boxCoxData(data):
    data_bxcxed = []
    for feature in range(data.shape[1]):
        data_bxcxed_feature, maxlog = boxcox(data[:,feature])
        if feature == 0:
            data_bxcxed = data_bxcxed_feature
        else:
            data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
    return data_bxcxed

def ScaleData(data):
    data_scaled = []
    for feature in range(data.shape[1]):
        data_scaled_feature = preprocessing.scale(data[:,feature])
        if feature == 0:
            data_scaled = data_scaled_feature
        else:
            data_scaled = np.column_stack([data_scaled, data_scaled_feature])
    return data_scaled

def testSVMPerformance(data_train, label_train, data_test, label_test, preprocess_method):
    
    if preprocess_method != "":
        data_train = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_train)
        data_test = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_test)
    
        if preprocess_method == "logged":
            data_train = np.log(data_train)
            data_test = np.log(data_test)
        elif preprocess_method == "sqrted":
            data_train = np.sqrt(data_train)
            data_test = np.sqrt(data_test)
        elif preprocess_method == "bxcxed":
            data_train = boxCoxData(data_train)
            data_test = boxCoxData(data_test)
            
        #this resulted in a more inferior performance compared to preprocessing.scale method
#         data_train = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_train)
#         data_test = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_test)

        data_train = ScaleData(data_train)
        data_test = ScaleData(data_test)        
        
    start = time()
    clf_SVM.fit(data_train, label_train)
    end = time()
    t_train = end - start
    #NOTE: For some reason this doesn't work here
#     t_train = train(clf_SVM, data_train, label_train)
    t_test, y_pred = predict(clf_SVM, data_test)
    precision, recall, fscore, support = precision_recall_fscore_support(label_test, y_pred, average='weighted')

    printout = preprocess_method
    if preprocess_method == "":
        printout = "noproc"
    
    printout += "  t_train: {:.3f}sec".format(t_train)
    printout += "  t_pred: {:.3f}sec".format(t_test)
    printout += "  precision: {:.2f}".format(precision)
    printout += "  recall: {:.2f}".format(recall)
    printout += "  fscore: {:.2f}".format(fscore)
    print printout

X_train_processed = X_train[Xs_cols]
X_test_processed = X_test[Xs_cols]

testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "scaled")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "logged")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "sqrted")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "bxcxed")
noproc  t_train: 0.808sec  t_pred: 0.621sec  precision: 0.84  recall: 0.81  fscore: 0.79
scaled  t_train: 0.715sec  t_pred: 0.549sec  precision: 0.85  recall: 0.82  fscore: 0.81
logged  t_train: 0.733sec  t_pred: 0.594sec  precision: 0.84  recall: 0.82  fscore: 0.82
sqrted  t_train: 0.713sec  t_pred: 0.555sec  precision: 0.84  recall: 0.82  fscore: 0.82
bxcxed  t_train: 0.649sec  t_pred: 0.538sec  precision: 0.87  recall: 0.87  fscore: 0.87

It is time to test if there is any outlier in the boxcoxed dataset.

In [11]:
Xs_processed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs)
Xs_bxcxed = boxCoxData(Xs_processed)
Xs_bxcxed_scaled = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(Xs_bxcxed)

outliers = []
for feature in range(Xs_bxcxed_scaled.shape[1]):
    Q1 = np.percentile(Xs_bxcxed_scaled[:, feature], 25)
    Q3 = np.percentile(Xs_bxcxed_scaled[:, feature], 75)
    step = 1.5 * (Q3 - Q1)

    outlier_filter = ~((Xs_bxcxed_scaled[:, feature] >= Q1 - step) & (Xs_bxcxed_scaled[:, feature] <= Q3 + step))
    
    cnt = 0
    for outlier in outlier_filter:
        if outlier:
            outliers.append(cnt)
        cnt += 1
    
# print "number of outliers with repeating indices: " + str(len(outliers))

id2cnt = {}
for outlier in outliers:
    if not outlier in id2cnt:
        id2cnt[outlier] = 1
    else:
        id2cnt[outlier] += 1
    
sorted_id2cnt = sorted(id2cnt.items(), key=operator.itemgetter(1), reverse=True)
cnt2nindices = {}
for key, value in sorted_id2cnt:
    #only remove the outliers that are repeated more than once
    if value <=1:
        break
    if not value in cnt2nindices:
        cnt2nindices[value] = 1
    else:
        cnt2nindices[value] += 1

for key, value in cnt2nindices.iteritems():
    print "{:2d} features share {:4d} potential outliers".format(key, value)
 2 features share   23 potential outliers
 3 features share 1953 potential outliers

Let's try to remove those 1953 potential outliers and test the performance of the SVM again. Although, this seems like losing too much data, I just want to see how this may effect the learning performance.

In [12]:
removed_outliers = []
for key, value in sorted_id2cnt:
    if value == 3:
        removed_outliers.append(key)

y_labels = y['Activity']

results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
    clf_SVM.fit(Xs.iloc[train], y_labels.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs.iloc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

printout = "subsetsize: {:5d}".format(Xs.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout

Xs_filtered = Xs.drop(removed_outliers)
y_filtered = y.drop(removed_outliers)
y_filtered_labels = y_filtered['Activity'].to_frame()

Xs_filtered_proc = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_filtered)
Xs_filtered_proc = boxCoxData(Xs_filtered_proc)
Xs_filtered_proc = ScaleData(Xs_filtered_proc)
Xs_filtered_proc = Xs_filtered_proc

results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered_proc.shape[0], n_folds=10, shuffle=False, random_state=42)  
for train, test in kfold:
    clf_SVM.fit(Xs_filtered_proc[train], y_filtered_labels.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs_filtered_proc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

print "**************"
printout = "subsetsize: {:5d}".format(Xs_filtered_proc.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout


results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered.shape[0], n_folds=10, shuffle=False, random_state=42)  
for train, test in kfold:
    clf_SVM.fit(Xs_filtered.iloc[train], y_filtered_labels.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
    t_test, y_pred = predict(clf_SVM, Xs_filtered.iloc[test])
    precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred, 
                                                                         average='weighted')
    results_precision.append(precision)
    results_recall.append(recall)
    results_fscore.append(fscore)

printout = "subsetsize: {:5d}".format(Xs_filtered.shape[0])
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
printout += "  precision: {:.2f}".format(np.mean(results_precision))
printout += "  recall: {:.2f}".format(np.mean(results_recall))
printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
print printout

from random import sample
n_multiplier = Xs.shape[0]/500

for i in range(1, n_multiplier+1):
    subsetsize = i*500
    random_index = sample(range(0, Xs.shape[0]), subsetsize)
    
    Xs_subset = Xs.iloc[random_index]
    y_subset = y_labels.iloc[random_index].to_frame()

    results_precision = []
    results_recall = []
    results_fscore = []
    kfold = cross_validation.KFold(Xs_subset.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        clf_SVM.fit(Xs_subset.iloc[train], y_subset.iloc[train])
#         t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
        t_test, y_pred = predict(clf_SVM, Xs_subset.iloc[test])
        precision, recall, fscore, support = precision_recall_fscore_support(y_subset.iloc[test], y_pred, 
                                                                             average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)

    printout = "subsetsize: {:5d}".format(subsetsize)
#     printout += "  t_train: {:.3f}sec".format(t_train)
#     printout += "  t_pred: {:.3f}sec".format(t_test)
    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    print printout
subsetsize: 10299  precision: 0.85  recall: 0.82  fscore: 0.81
**************
subsetsize:  8346  precision: 0.88  recall: 0.87  fscore: 0.87
subsetsize:  8346  precision: 0.82  recall: 0.78  fscore: 0.76
subsetsize:   500  precision: 0.81  recall: 0.77  fscore: 0.74
subsetsize:  1000  precision: 0.85  recall: 0.79  fscore: 0.76
subsetsize:  1500  precision: 0.82  recall: 0.80  fscore: 0.79
subsetsize:  2000  precision: 0.81  recall: 0.79  fscore: 0.79
subsetsize:  2500  precision: 0.85  recall: 0.80  fscore: 0.77
subsetsize:  3000  precision: 0.84  recall: 0.81  fscore: 0.80
subsetsize:  3500  precision: 0.86  recall: 0.82  fscore: 0.79
subsetsize:  4000  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  4500  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  5000  precision: 0.85  recall: 0.83  fscore: 0.81
subsetsize:  5500  precision: 0.85  recall: 0.82  fscore: 0.80
subsetsize:  6000  precision: 0.86  recall: 0.83  fscore: 0.81
subsetsize:  6500  precision: 0.85  recall: 0.82  fscore: 0.81
subsetsize:  7000  precision: 0.86  recall: 0.82  fscore: 0.81
subsetsize:  7500  precision: 0.85  recall: 0.83  fscore: 0.81
subsetsize:  8000  precision: 0.86  recall: 0.82  fscore: 0.81
subsetsize:  8500  precision: 0.85  recall: 0.83  fscore: 0.82
subsetsize:  9000  precision: 0.86  recall: 0.83  fscore: 0.82
subsetsize:  9500  precision: 0.86  recall: 0.83  fscore: 0.81
subsetsize: 10000  precision: 0.86  recall: 0.83  fscore: 0.82

This shows that feature preprocessing and outlier removal are tied together. In other words, detected outliers are special to the space they are transformed to through preprocessing methods. Therefore, outliers in transformed space may not be outliers in the original space. Following results show that removing the outliers is only good if the learning is done on the space where those features transformed to.

subsetsize: 8346 precision: 0.87 recall: 0.86 fscore: 0.86 (features are preprocessed)
subsetsize: 8346 precision: 0.80 recall: 0.76 fscore: 0.74 (features are kept as the way they are)

In [ ]:
from sklearn.decomposition import PCA

num_components = range(2, X.shape[1]/10, 2) + range(X.shape[1]/10, X.shape[1]/3, X.shape[1]/40)

d_ncpa_to_precision = {}
d_npca_to_recall = {}
d_npca_to_f1score = {}

for n_components in num_components:
    pca = PCA(n_components=n_components).fit(X)
#     print pca.explained_variance_ratio_
    printout = "n_components: {:d}".format(n_components)
    X_pcaed = pca.transform(X)
    
    results_precision = []
    results_recall = []
    results_fscore = []    
    kfold = cross_validation.KFold(X_pcaed.shape[0], n_folds=10, shuffle=False, random_state=42)
    for train, test in kfold:
        clf_SVM.fit(X_pcaed[train], y_[train])
        t_test, y_pred = predict(clf_SVM, X_pcaed[test])
        precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, 
                                                                             average='weighted')
        results_precision.append(precision)
        results_recall.append(recall)
        results_fscore.append(fscore)

    printout += "  precision: {:.2f}".format(np.mean(results_precision))
    printout += "  recall: {:.2f}".format(np.mean(results_recall))
    printout += "  fscore: {:.2f}".format(np.mean(results_fscore))
    print printout

    d_ncpa_to_precision[n_components]=np.mean(results_precision)
    d_npca_to_recall[n_components]=np.mean(results_recall)
    d_npca_to_f1score[n_components]=np.mean(results_fscore)    
In [48]:
plt.rcParams['figure.figsize'] = (20.0, 16.0)
plt.grid(True)
major_ticks = np.arange(0, X.shape[1]/10, 20) 
minor_ticks = np.arange(0, X.shape[1]/10, 5)

# ax.set_xticks(major_ticks)                                                       
# ax.set_xticks(minor_ticks, minor=True) 
plt.xticks(minor_ticks)
plt.plot(d_ncpa_to_precision.keys(), d_ncpa_to_precision.values(), 'r',
        d_npca_to_recall.keys(), d_npca_to_recall.values(), 'g',
        d_npca_to_f1score.keys(), d_npca_to_f1score.values(), 'b')
plt.show()
In [75]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
import numpy as np

def pca_results(good_data, pca):
    '''
    Create a DataFrame of the PCA results
    Includes dimension feature weights and explained variance
    Visualizes the PCA results
    '''

    # Dimension indexing
    dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = good_data.keys())
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (18,8))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar');
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)

    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n          %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)
In [76]:
from sklearn.decomposition import PCA

n_components = 2
pca = PCA(n_components=n_components).fit(Xs)
# print pca.components_
# print pca.explained_variance_ratio_
# print pca_results['Explained Variance'].cumsum()
pca_results = pca_results(Xs, pca)

# # TODO: Transform the good data using the PCA fit above
# Xs_pcaed = pca.transform(Xs)
# print Xs.shape
# print Xs_pcaed.shape

# # Create a DataFrame for the reduced data
# Xs_pcaed = pd.DataFrame(Xs_pcaed, columns = ['Dimension 1', 'Dimension 2'])
# print Xs_pcaed.shape

# # Produce a scatter matrix for pca reduced data
# pd.scatter_matrix(Xs_pcaed, alpha = 0.8, figsize = (8,4), diagonal = 'kde');

# components = pd.DataFrame(np.round(pca.components_, 4), columns = Xs.keys())
# print components
In [92]:
from sklearn.feature_selection import SelectKBest
from scipy.stats import boxcox
from sklearn import preprocessing
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

def boxCoxData(data):
    data_bxcxed = []
    for feature in range(data.shape[1]):
        data_bxcxed_feature, maxlog = boxcox(data[:,feature])
        if feature == 0:
            data_bxcxed = data_bxcxed_feature
        else:
            data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
    return data_bxcxed

def ScaleData(data):
    data_scaled = []
    for feature in range(data.shape[1]):
        data_scaled_feature = preprocessing.scale(data[:,feature])
        if feature == 0:
            data_scaled = data_scaled_feature
        else:
            data_scaled = np.column_stack([data_scaled, data_scaled_feature])
    return data_scaled

def predict(clf, features):
    start = time()
    pred = clf.predict(features)
    end = time()
    return end - start, pred

# kbest_param_vals = [5, 10, 15, 20, 30, 50, 100, 200, X.shape[1]]
# pca_n_components = [2, 5, 10, 15, 20, 30, 40, 50, 100, 200]

# kbest_param_vals = [16]
# pca_n_components = [2]

kbest_param_vals = [5]
pca_n_components = [50]

for kbest in kbest_param_vals:
    start = time()
    #choose kbest feature dimensions
    f_selector = SelectKBest(k=kbest)
    X_slctd = f_selector.fit(X, y['Activity']).transform(X)
    f_selected_indices = f_selector.get_support(indices=False)
    X_slctd_cols = X.columns[f_selected_indices]
    
    #transform these features to another space where they are less skewed
    X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(X_slctd)
    X_slctd_tformed = boxCoxData(X_slctd_tformed)
    X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(X_slctd_tformed)
    X_slctd_tformed = pd.DataFrame(data=X_slctd_tformed, index=range(X_slctd_tformed.shape[0]), columns=X_slctd_cols)
    end = time()
    
    for pca_n in pca_n_components:
        column_names = []
        for i in range(pca_n):
            column_names.append("component{:2d}".format(i))
            
        start_pca = time()
        pca = PCA(n_components=pca_n).fit(X)
        X_pcaed = pca.transform(X)
        X_pcaed = pd.DataFrame(data=X_pcaed, index=range(X_pcaed.shape[0]), columns=column_names)
        
        X_combined = pd.concat([X_slctd_tformed, X_pcaed], axis=1)
        end_pca = time()
        t_proc = (end - start) + (end_pca - start_pca)
        results_precision = []
        results_recall = []
        results_fscore = []
        kfold = cross_validation.KFold(X_combined.shape[0], n_folds=10, shuffle=False, random_state=42)
        t_trains = []
        t_tests = []
        for train, test in kfold:
            t_train_s = time()
            clf_SVM.fit(X_combined.iloc[train], y_[train])
            t_trains.append( time() - t_train_s )
            t_test, y_pred = predict(clf_SVM, X_combined.iloc[test])
            t_tests.append(t_test)
            precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, 
                                                                                 average='weighted')
            results_precision.append(precision)
            results_recall.append(recall)
            results_fscore.append(fscore)

        printout = "(kbest{:3d})(pca_n{:3d})".format(kbest, pca_n)
        printout += "  precision: {:.2f}".format(np.mean(results_precision))
        printout += "  recall: {:.2f}".format(np.mean(results_recall))
        printout += "  fscore: {:.2f}  ".format(np.mean(results_fscore))
        printout += "  t_proc: {:.2f} t_train: {:.2f} t_test: {:.2f}".format(t_proc, np.mean(t_trains), np.mean(t_tests))
        print printout
(kbest  5)(pca_n 50)  precision: 0.94  recall: 0.93  fscore: 0.93    t_proc: 1.13 t_train: 0.86 t_test: 0.19
In [97]:
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

kfold = cross_validation.KFold(X_combined.shape[0], n_folds=10, shuffle=False, random_state=42)
grid = GridSearchCV(SVC(), param_grid=param_grid, cv=kfold)
grid.fit(X_combined, y_)

print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))
The best parameters are {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001} with a score of 0.93
In [100]:
print grid.grid_scores_
[mean: 0.92582, std: 0.02688, params: {'kernel': 'linear', 'C': 1}, mean: 0.92193, std: 0.02757, params: {'kernel': 'linear', 'C': 10}, mean: 0.91960, std: 0.02925, params: {'kernel': 'linear', 'C': 100}, mean: 0.91873, std: 0.02919, params: {'kernel': 'linear', 'C': 1000}, mean: 0.90601, std: 0.02847, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}, mean: 0.85669, std: 0.03313, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}, mean: 0.92533, std: 0.02871, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}, mean: 0.90659, std: 0.02807, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}, mean: 0.92854, std: 0.02690, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}, mean: 0.92426, std: 0.02987, params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}, mean: 0.93009, std: 0.02460, params: {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}, mean: 0.92669, std: 0.02769, params: {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}]

It is now time to optimize training parameters of the models while incorporating best set of features. My hypothesis is that this will yield in the best classification performance.[NOTE: TO BE CONTINUED]

In [22]:
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    return r2_score(y_true, y_predict)

from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit

def fit_model(X, y):

    # Create cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
    params = {'max_depth': range(1,20)}
    
    # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    regressor = DecisionTreeRegressor(max_depth = params['max_depth'], random_state=42)
    grid = GridSearchCV(regressor, param_grid=params, scoring=scoring_fnc)
    grid = grid.fit(X, y)
    return grid.best_estimator_

# clf = fit_model(X,y)
clf = fit_model(X_train,y_train)

print clf.score(X_train, y_train)
print clf.score(X_test, y_test)
0.190959174602
-7.59452573956